55 research outputs found
Visual Summary of Egocentric Photostreams by Representative Keyframes
Building a visual summary from an egocentric photostream captured by a
lifelogging wearable camera is of high interest for different applications
(e.g. memory reinforcement). In this paper, we propose a new summarization
method based on keyframes selection that uses visual features extracted by
means of a convolutional neural network. Our method applies an unsupervised
clustering for dividing the photostreams into events, and finally extracts the
most relevant keyframe for each event. We assess the results by applying a
blind-taste test on a group of 20 people who assessed the quality of the
summaries.Comment: Paper accepted in the IEEE First International Workshop on Wearable
and Ego-vision Systems for Augmented Experience (WEsAX). Turin, Italy. July
3, 201
Simple vs complex temporal recurrences for video saliency prediction
This paper investigates modifying an existing neural network architecture for static saliency prediction using two types of recurrences that integrate information from the temporal domain. The first modification is the addition of a ConvLSTM within the architecture, while the second is a conceptually simple exponential moving average of an internal convolutional state. We use weights pre-trained on the SALICON dataset and fine-tune our model on DHF1K. Our results show that both modifications achieve state-of-the-art results and produce similar saliency maps. Source code is available at https://git.io/fjPiB
Visual Information Retrieval in Endoscopic Video Archives
In endoscopic procedures, surgeons work with live video streams from the
inside of their subjects. A main source for documentation of procedures are
still frames from the video, identified and taken during the surgery. However,
with growing demands and technical means, the streams are saved to storage
servers and the surgeons need to retrieve parts of the videos on demand. In
this submission we present a demo application allowing for video retrieval
based on visual features and late fusion, which allows surgeons to re-find
shots taken during the procedure.Comment: Paper accepted at the IEEE/ACM 13th International Workshop on
Content-Based Multimedia Indexing (CBMI) in Prague (Czech Republic) between
10 and 12 June 201
Hyper-Representations for Pre-Training and Transfer Learning
Learning representations of neural network weights given a model zoo is an
emerging and challenging area with many potential applications from model
inspection, to neural architecture search or knowledge distillation. Recently,
an autoencoder trained on a model zoo was able to learn a hyper-representation,
which captures intrinsic and extrinsic properties of the models in the zoo. In
this work, we extend hyper-representations for generative use to sample new
model weights as pre-training. We propose layer-wise loss normalization which
we demonstrate is key to generate high-performing models and a sampling method
based on the empirical density of hyper-representations. The models generated
using our methods are diverse, performant and capable to outperform
conventional baselines for transfer learning. Our results indicate the
potential of knowledge aggregation from model zoos to new models via
hyper-representations thereby paving the avenue for novel research directions
Improving Spatial Codification in Semantic Segmentation
This paper explores novel approaches for improving the spatial codification
for the pooling of local descriptors to solve the semantic segmentation
problem. We propose to partition the image into three regions for each object
to be described: Figure, Border and Ground. This partition aims at minimizing
the influence of the image context on the object description and vice versa by
introducing an intermediate zone around the object contour. Furthermore, we
also propose a richer visual descriptor of the object by applying a Spatial
Pyramid over the Figure region. Two novel Spatial Pyramid configurations are
explored: Cartesian-based and crown-based Spatial Pyramids. We test these
approaches with state-of-the-art techniques and show that they improve the
Figure-Ground based pooling in the Pascal VOC 2011 and 2012 semantic
segmentation challenges.Comment: Paper accepted at the IEEE International Conference on Image
Processing, ICIP 2015. Quebec City, 27-30 September. Project page:
https://imatge.upc.edu/web/publications/improving-spatial-codification-semantic-segmentatio
Sign Language Translation from Instructional Videos
The advances in automatic sign language translation (SLT) to spoken languages
have been mostly benchmarked with datasets of limited size and restricted
domains. Our work advances the state of the art by providing the first baseline
results on How2Sign, a large and broad dataset.
We train a Transformer over I3D video features, using the reduced BLEU as a
reference metric for validation, instead of the widely used BLEU score. We
report a result of 8.03 on the BLEU score, and publish the first open-source
implementation of its kind to promote further advances.Comment: Paper accepted at WiCV @CVPR2
Temporal saliency adaptation in egocentric videos
This work adapts a deep neural model for image saliency
prediction to the temporal domain of egocentric video. We compute the
saliency map for each video frame, firstly with an off-the-shelf model
trained from static images, secondly by adding a a convolutional or
conv-LSTM layers trained with a dataset for video saliency prediction.
We study each configuration on EgoMon, a new dataset made of seven
egocentric videos recorded by three subjects in both free-viewing and
task-driven set ups. Our results indicate that the temporal adaptation is
beneficial when the viewer is not moving and observing the scene from
a narrow field of view. Encouraged by this observation, we compute and
publish the saliency maps for the EPIC Kitchens dataset, in which view-
ers are cooking
Assessing knee OA severity with CNN attention-based end-to-end architectures
This work proposes a novel end-to-end convolutional neural network (CNN) architecture to automatically quantify the severity of knee osteoarthritis (OA) using X-Ray images, which incorporates trainable attention modules acting as unsupervised fine-grained detectors of the region of interest (ROI). The proposed attention modules can be applied at different levels and scales across any CNN pipeline helping the network to learn relevant attention patterns over the most informative parts of the image at different resolutions. We test the proposed attention mechanism on existing state-of-the-art CNN architectures as our base models, achieving promising results on the benchmark knee OA datasets from the osteoarthritis initiative (OAI) and multicenter osteoarthritis study (MOST). All code from our experiments will be publicly available on the github repository: https://github.com/marc-gorriz/KneeOA-CNNAttentio
- …